Deep Learning: A Simple Example

  • Let’s get back to the Name Gender Classifier.

Prepare Data

import numpy as np
import nltk
import random
with open("../../../RepositoryData/data/_ENC2045_DATA/chinese_name_gender.txt") as f:
    labeled_names = [l.replace('\n','').split(',') for l in f.readlines() if len(l.split(','))==2]
labeled_names =[(n, 1) if g=="男" else (n, 0) for n, g in labeled_names]
labeled_names[:10]
[('阿貝貝', 0),
 ('阿彬彬', 1),
 ('阿斌斌', 1),
 ('阿冰冰', 0),
 ('阿波波', 1),
 ('阿超超', 1),
 ('阿春兒', 0),
 ('阿達禮', 1),
 ('阿丹丹', 0),
 ('阿丹兒', 0)]
random.shuffle(labeled_names)

Train-Test Split

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(labeled_names, test_size = 0.2, random_state=42)
print(len(train_set), len(test_set))
732516 183129
import tensorflow as tf
import tensorflow.keras as keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical, plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, RNN, GRU
from keras.layers import Embedding
from keras.layers import SpatialDropout1D
names = [n for (n, l) in train_set]
labels = [l for (n, l) in train_set] 
len(names)
732516
nltk.FreqDist(labels)
FreqDist({1: 475272, 0: 257244})

Tokenizer

  • By default, the token index 0 is reserved for padding token.

  • If oov_token is specified, it is default to index 1.

  • Specify num_words for tokenizer to include only top N words in the model

  • Tokenizer will automatically remove puntuations.

  • Tokenizer use whitespace as word delimiter.

  • If every character is treated as a token, specify char_level=True.

tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(names)

Prepare Input and Output Tensors

  • Like in feature-based machine translation, a computational model only accepts numeric values. It is necessary to convert raw text to numeric tensor for neural network.

  • After we create the Tokenizer, we use the Tokenizer to perform text vectorization, i.e., converting texts into tensors.

  • In deep learning, words or characters are automatically converted into numeric representations.

  • In other words, the feature engineering step is fully automatic.

Two Ways of Text Vectorization

  • Texts to Sequences: Integer encoding of tokens in texts and learn token embeddings

  • Texts to Matrix: One-hot encoding of texts (similar to bag-of-words model)

Method 1: Text to Sequences

From Texts and Sequences

  • Text to Sequences

  • Padding to uniform lengths for each text

names_ints = tokenizer.texts_to_sequences(names)
print(names[:10])
print(names_ints[:10])
print(labels[:10])
['李照華', '宋朝輝', '諸葛偉', '林振杰', '石星星', '謝昕昕', '俞銀兒', '齊春輝', '林紫馨', '羅偉生']
[[2, 585, 10], [78, 250, 48], [918, 340, 18], [7, 95, 749], [197, 228, 228], [73, 641, 641], [330, 242, 458], [327, 28, 48], [7, 525, 542], [63, 18, 50]]
[1, 1, 1, 1, 1, 0, 0, 1, 0, 1]

Vocabulary

# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
Vocabulary Size: 2241
tokenizer.word_index
{'王': 1,
 '李': 2,
 '張': 3,
 '陳': 4,
 '劉': 5,
 '文': 6,
 '林': 7,
 '明': 8,
 '楊': 9,
 '華': 10,
 '黃': 11,
 '吳': 12,
 '金': 13,
 '曉': 14,
 '周': 15,
 '國': 16,
 '趙': 17,
 '偉': 18,
 '海': 19,
 '玉': 20,
 '志': 21,
 '徐': 22,
 '麗': 23,
 '建': 24,
 '紅': 25,
 '平': 26,
 '英': 27,
 '春': 28,
 '軍': 29,
 '朱': 30,
 '孫': 31,
 '龍': 32,
 '永': 33,
 '胡': 34,
 '德': 35,
 '榮': 36,
 '東': 37,
 '成': 38,
 '雲': 39,
 '芳': 40,
 '郭': 41,
 '鄭': 42,
 '馬': 43,
 '高': 44,
 '新': 45,
 '梅': 46,
 '何': 47,
 '輝': 48,
 '秀': 49,
 '生': 50,
 '玲': 51,
 '傑': 52,
 '世': 53,
 '俊': 54,
 '強': 55,
 '光': 56,
 '洪': 57,
 '江': 58,
 '豔': 59,
 '燕': 60,
 '慶': 61,
 '子': 62,
 '羅': 63,
 '蘭': 64,
 '峯': 65,
 '忠': 66,
 '宇': 67,
 '鳳': 68,
 '清': 69,
 '霞': 70,
 '美': 71,
 '祥': 72,
 '謝': 73,
 '興': 74,
 '立': 75,
 '萍': 76,
 '梁': 77,
 '宋': 78,
 '雪': 79,
 '良': 80,
 '家': 81,
 '福': 82,
 '葉': 83,
 '慧': 84,
 '許': 85,
 '娟': 86,
 '飛': 87,
 '佳': 88,
 '寶': 89,
 '學': 90,
 '安': 91,
 '亞': 92,
 '波': 93,
 '珍': 94,
 '振': 95,
 '鵬': 96,
 '敏': 97,
 '元': 98,
 '利': 99,
 '蔡': 100,
 '斌': 101,
 '勇': 102,
 '瑞': 103,
 '大': 104,
 '方': 105,
 '韓': 106,
 '正': 107,
 '唐': 108,
 '天': 109,
 '曹': 110,
 '宏': 111,
 '少': 112,
 '武': 113,
 '沈': 114,
 '民': 115,
 '田': 116,
 '鄧': 117,
 '亮': 118,
 '馮': 119,
 '程': 120,
 '濤': 121,
 '君': 122,
 '超': 123,
 '琴': 124,
 '蔣': 125,
 '潘': 126,
 '昌': 127,
 '曾': 128,
 '蘇': 129,
 '彭': 130,
 '董': 131,
 '長': 132,
 '肖': 133,
 '桂': 134,
 '餘': 135,
 '秋': 136,
 '勝': 137,
 '萬': 138,
 '中': 139,
 '於': 140,
 '淑': 141,
 '松': 142,
 '青': 143,
 '婷': 144,
 '靜': 145,
 '剛': 146,
 '丁': 147,
 '貴': 148,
 '袁': 149,
 '杜': 150,
 '呂': 151,
 '陽': 152,
 '芬': 153,
 '思': 154,
 '魏': 155,
 '澤': 156,
 '愛': 157,
 '廣': 158,
 '惠': 159,
 '任': 160,
 '鋒': 161,
 '山': 162,
 '一': 163,
 '義': 164,
 '姚': 165,
 '花': 166,
 '香': 167,
 '月': 168,
 '盧': 169,
 '全': 170,
 '仁': 171,
 '智': 172,
 '鍾': 173,
 '維': 174,
 '娜': 175,
 '友': 176,
 '雅': 177,
 '範': 178,
 '夏': 179,
 '富': 180,
 '汪': 181,
 '莉': 182,
 '康': 183,
 '崔': 184,
 '宗': 185,
 '遠': 186,
 '陸': 187,
 '姜': 188,
 '浩': 189,
 '樹': 190,
 '衛': 191,
 '廖': 192,
 '旭': 193,
 '彬': 194,
 '兵': 195,
 '夢': 196,
 '石': 197,
 '丹': 198,
 '繼': 199,
 '嘉': 200,
 '章': 201,
 '賢': 202,
 '雨': 203,
 '連': 204,
 '和': 205,
 '根': 206,
 '景': 207,
 '發': 208,
 '坤': 209,
 '孟': 210,
 '寧': 211,
 '譚': 212,
 '雷': 213,
 '才': 214,
 '蓮': 215,
 '琳': 216,
 '賈': 217,
 '啓': 218,
 '雄': 219,
 '順': 220,
 '潔': 221,
 '欣': 222,
 '健': 223,
 '傳': 224,
 '凱': 225,
 '錦': 226,
 '邱': 227,
 '星': 228,
 '白': 229,
 '翠': 230,
 '穎': 231,
 '素': 232,
 '付': 233,
 '侯': 234,
 '喜': 235,
 '鄒': 236,
 '羣': 237,
 '瓊': 238,
 '祖': 239,
 '彥': 240,
 '先': 241,
 '銀': 242,
 '吉': 243,
 '培': 244,
 '熊': 245,
 '顧': 246,
 '怡': 247,
 '耀': 248,
 '鑫': 249,
 '朝': 250,
 '菊': 251,
 '士': 252,
 '鴻': 253,
 '毛': 254,
 '水': 255,
 '戴': 256,
 '秦': 257,
 '劍': 258,
 '有': 259,
 '進': 260,
 '雯': 261,
 '克': 262,
 '尹': 263,
 '會': 264,
 '樂': 265,
 '漢': 266,
 '史': 267,
 '黎': 268,
 '紹': 269,
 '書': 270,
 '瑩': 271,
 '泉': 272,
 '向': 273,
 '邵': 274,
 '彩': 275,
 '薛': 276,
 '茂': 277,
 '冬': 278,
 '盛': 279,
 '保': 280,
 '兆': 281,
 '源': 282,
 '博': 283,
 '錢': 284,
 '達': 285,
 '妹': 286,
 '段': 287,
 '郝': 288,
 '南': 289,
 '開': 290,
 '如': 291,
 '權': 292,
 '仙': 293,
 '銘': 294,
 '洋': 295,
 '琪': 296,
 '賀': 297,
 '蓉': 298,
 '奇': 299,
 '芝': 300,
 '常': 301,
 '森': 302,
 '雙': 303,
 '道': 304,
 '龔': 305,
 '延': 306,
 '孔': 307,
 '倩': 308,
 '恩': 309,
 '恆': 310,
 '來': 311,
 '尚': 312,
 '嚴': 313,
 '媛': 314,
 '虎': 315,
 '其': 316,
 '巧': 317,
 '嬌': 318,
 '豪': 319,
 '炳': 320,
 '施': 321,
 '容': 322,
 '湯': 323,
 '陶': 324,
 '磊': 325,
 '賴': 326,
 '齊': 327,
 '茹': 328,
 '毅': 329,
 '俞': 330,
 '躍': 331,
 '溫': 332,
 '川': 333,
 '柳': 334,
 '佩': 335,
 '凌': 336,
 '翔': 337,
 '運': 338,
 '晨': 339,
 '葛': 340,
 '閆': 341,
 '禮': 342,
 '韋': 343,
 '承': 344,
 '冰': 345,
 '敬': 346,
 '妮': 347,
 '聖': 348,
 '力': 349,
 '棟': 350,
 '孝': 351,
 '哲': 352,
 '日': 353,
 '紀': 354,
 '珊': 355,
 '豐': 356,
 '應': 357,
 '楠': 358,
 '珠': 359,
 '代': 360,
 '增': 361,
 '威': 362,
 '莊': 363,
 '旺': 364,
 '傅': 365,
 '仲': 366,
 '牛': 367,
 '顏': 368,
 '科': 369,
 '芹': 370,
 '碧': 371,
 '晶': 372,
 '詩': 373,
 '倪': 374,
 '益': 375,
 '風': 376,
 '善': 377,
 '樊': 378,
 '路': 379,
 '菲': 380,
 '魯': 381,
 '業': 382,
 '娥': 383,
 '嶽': 384,
 '三': 385,
 '懷': 386,
 '勤': 387,
 '定': 388,
 '佔': 389,
 '煥': 390,
 '易': 391,
 '廷': 392,
 '喬': 393,
 '莫': 394,
 '苗': 395,
 '柏': 396,
 '瑤': 397,
 '凡': 398,
 '治': 399,
 '邢': 400,
 '本': 401,
 '壽': 402,
 '琦': 403,
 '希': 404,
 '心': 405,
 '錫': 406,
 '信': 407,
 '奎': 408,
 '守': 409,
 '爲': 410,
 '舒': 411,
 '政': 412,
 '婉': 413,
 '軒': 414,
 '加': 415,
 '顯': 416,
 '然': 417,
 '關': 418,
 '仕': 419,
 '虹': 420,
 '祝': 421,
 '伯': 422,
 '貞': 423,
 '申': 424,
 '潤': 425,
 '揚': 426,
 '倫': 427,
 '之': 428,
 '太': 429,
 '薇': 430,
 '若': 431,
 '鳴': 432,
 '阮': 433,
 '靈': 434,
 '鐵': 435,
 '聰': 436,
 '真': 437,
 '聶': 438,
 '璐': 439,
 '洲': 440,
 '伍': 441,
 '藝': 442,
 '歐': 443,
 '同': 444,
 '童': 445,
 '翟': 446,
 '男': 447,
 '露': 448,
 '卓': 449,
 '殷': 450,
 '西': 451,
 '龐': 452,
 '誠': 453,
 '可': 454,
 '憲': 455,
 '升': 456,
 '崇': 457,
 '兒': 458,
 '堂': 459,
 '季': 460,
 '柯': 461,
 '妍': 462,
 '育': 463,
 '園': 464,
 '卿': 465,
 '耿': 466,
 '蕾': 467,
 '焦': 468,
 '年': 469,
 '欽': 470,
 '儀': 471,
 '柱': 472,
 '堅': 473,
 '朋': 474,
 '楚': 475,
 '財': 476,
 '基': 477,
 '燦': 478,
 '曼': 479,
 '婧': 480,
 '臣': 481,
 '巖': 482,
 '翁': 483,
 '相': 484,
 '單': 485,
 '城': 486,
 '左': 487,
 '京': 488,
 '霖': 489,
 '娣': 490,
 '久': 491,
 '晴': 492,
 '芸': 493,
 '笑': 494,
 '昭': 495,
 '彪': 496,
 '標': 497,
 '存': 498,
 '藍': 499,
 '修': 500,
 '木': 501,
 '包': 502,
 '時': 503,
 '莎': 504,
 '彤': 505,
 '涵': 506,
 '裕': 507,
 '法': 508,
 '帥': 509,
 '湘': 510,
 '作': 511,
 '歡': 512,
 '畢': 513,
 '甘': 514,
 '二': 515,
 '自': 516,
 '瑜': 517,
 '曲': 518,
 '勳': 519,
 '登': 520,
 '邦': 521,
 '瑋': 522,
 '炎': 523,
 '煒': 524,
 '紫': 525,
 '艾': 526,
 '悅': 527,
 '依': 528,
 '逸': 529,
 '航': 530,
 '庭': 531,
 '沙': 532,
 '宜': 533,
 '鈺': 534,
 '冠': 535,
 '霍': 536,
 '昊': 537,
 '泰': 538,
 '河': 539,
 '滿': 540,
 '裴': 541,
 '馨': 542,
 '鮑': 543,
 '均': 544,
 '塗': 545,
 '微': 546,
 '辰': 547,
 '亭': 548,
 '谷': 549,
 '詹': 550,
 '竹': 551,
 '麟': 552,
 '辛': 553,
 '斯': 554,
 '圓': 555,
 '奕': 556,
 '茜': 557,
 '濱': 558,
 '純': 559,
 '賓': 560,
 '騰': 561,
 '從': 562,
 '汝': 563,
 '覃': 564,
 '阿': 565,
 '堯': 566,
 '晉': 567,
 '行': 568,
 '細': 569,
 '鈞': 570,
 '饒': 571,
 '駱': 572,
 '厚': 573,
 '迎': 574,
 '緒': 575,
 '夫': 576,
 '迪': 577,
 '賽': 578,
 '祿': 579,
 '柴': 580,
 '橋': 581,
 '姣': 582,
 '姍': 583,
 '蓓': 584,
 '照': 585,
 '貝': 586,
 '通': 587,
 '婭': 588,
 '人': 589,
 '能': 590,
 '枝': 591,
 '管': 592,
 '菁': 593,
 '震': 594,
 '冉': 595,
 '靖': 596,
 '盈': 597,
 '伊': 598,
 '暉': 599,
 '公': 600,
 '初': 601,
 '隆': 602,
 '乃': 603,
 '萌': 604,
 '鎮': 605,
 '雁': 606,
 '熙': 607,
 '功': 608,
 '司': 609,
 '環': 610,
 '遊': 611,
 '璇': 612,
 '觀': 613,
 '俠': 614,
 '乾': 615,
 '令': 616,
 '桃': 617,
 '以': 618,
 '梓': 619,
 '尤': 620,
 '云': 621,
 '寬': 622,
 '甜': 623,
 '原': 624,
 '再': 625,
 '靳': 626,
 '聲': 627,
 '銳': 628,
 '獻': 629,
 '祁': 630,
 '瀟': 631,
 '駿': 632,
 '妙': 633,
 '瑛': 634,
 '前': 635,
 '亦': 636,
 '得': 637,
 '盼': 638,
 '古': 639,
 '鄔': 640,
 '昕': 641,
 '映': 642,
 '印': 643,
 '房': 644,
 '秉': 645,
 '筱': 646,
 '儒': 647,
 '樓': 648,
 '解': 649,
 '滕': 650,
 '四': 651,
 '喻': 652,
 '竇': 653,
 '佑': 654,
 '符': 655,
 '叢': 656,
 '經': 657,
 '殿': 658,
 '鶴': 659,
 '項': 660,
 '屈': 661,
 '媚': 662,
 '鋼': 663,
 '謙': 664,
 '睿': 665,
 '羽': 666,
 '舉': 667,
 '九': 668,
 '蕊': 669,
 '蕭': 670,
 '爾': 671,
 '繆': 672,
 '宮': 673,
 '嫺': 674,
 '穆': 675,
 '理': 676,
 '意': 677,
 '巍': 678,
 '名': 679,
 '池': 680,
 '昆': 681,
 '火': 682,
 '芮': 683,
 '姬': 684,
 '沛': 685,
 '韻': 686,
 '閻': 687,
 '嵐': 688,
 '蔚': 689,
 '影': 690,
 '戚': 691,
 '冷': 692,
 '虞': 693,
 '費': 694,
 '杏': 695,
 '嶺': 696,
 '百': 697,
 '車': 698,
 '卜': 699,
 '宣': 700,
 '瑾': 701,
 '鬱': 702,
 '寒': 703,
 '言': 704,
 '起': 705,
 '斐': 706,
 '帆': 707,
 '化': 708,
 '官': 709,
 '褚': 710,
 '重': 711,
 '嬋': 712,
 '戎': 713,
 '展': 714,
 '煌': 715,
 '甫': 716,
 '毓': 717,
 '添': 718,
 '米': 719,
 '師': 720,
 '婁': 721,
 '釗': 722,
 '桑': 723,
 '芷': 724,
 '濟': 725,
 '禹': 726,
 '婕': 727,
 '牟': 728,
 '千': 729,
 '淵': 730,
 '濛': 731,
 '營': 732,
 '蒙': 733,
 '翰': 734,
 '魁': 735,
 '蒲': 736,
 '勁': 737,
 '煜': 738,
 '越': 739,
 '合': 740,
 '述': 741,
 '召': 742,
 '念': 743,
 '皓': 744,
 '姝': 745,
 '戰': 746,
 '舟': 747,
 '壯': 748,
 '杰': 749,
 '仇': 750,
 '允': 751,
 '樸': 752,
 '黨': 753,
 '樑': 754,
 '苑': 755,
 '詠': 756,
 '普': 757,
 '農': 758,
 '霜': 759,
 '萱': 760,
 '欒': 761,
 '麥': 762,
 '鈴': 763,
 '臧': 764,
 '潮': 765,
 '北': 766,
 '必': 767,
 '茵': 768,
 '卞': 769,
 '井': 770,
 '徵': 771,
 '閔': 772,
 '隋': 773,
 '席': 774,
 '聞': 775,
 '品': 776,
 '寅': 777,
 '邊': 778,
 '桐': 779,
 '果': 780,
 '丙': 781,
 '綺': 782,
 '祺': 783,
 '週': 784,
 '致': 785,
 '伶': 786,
 '佟': 787,
 '昱': 788,
 '商': 789,
 '望': 790,
 '幼': 791,
 '聯': 792,
 '晏': 793,
 '幫': 794,
 '俐': 795,
 '曙': 796,
 '齡': 797,
 '庚': 798,
 '刁': 799,
 '昇': 800,
 '慕': 801,
 '效': 802,
 '衍': 803,
 '烈': 804,
 '現': 805,
 '好': 806,
 '弘': 807,
 '鼎': 808,
 '弟': 809,
 '瞿': 810,
 '繁': 811,
 '曦': 812,
 '查': 813,
 '儲': 814,
 '寇': 815,
 '蘋': 816,
 '攀': 817,
 '髮': 818,
 '郎': 819,
 '在': 820,
 '選': 821,
 '旗': 822,
 '簡': 823,
 '里': 824,
 '于': 825,
 '珂': 826,
 '韶': 827,
 '衡': 828,
 '楓': 829,
 '球': 830,
 '琛': 831,
 '鬆': 832,
 '改': 833,
 '端': 834,
 '州': 835,
 '步': 836,
 '猛': 837,
 '岑': 838,
 '遲': 839,
 '談': 840,
 '嘯': 841,
 '佐': 842,
 '霄': 843,
 '捷': 844,
 '軼': 845,
 '璽': 846,
 '革': 847,
 '留': 848,
 '含': 849,
 '匡': 850,
 '荷': 851,
 '韜': 852,
 '碩': 853,
 '嫣': 854,
 '封': 855,
 '杭': 856,
 '姿': 857,
 '巨': 858,
 '幸': 859,
 '榕': 860,
 '屠': 861,
 '訓': 862,
 '知': 863,
 '上': 864,
 '卉': 865,
 '冀': 866,
 '燁': 867,
 '峻': 868,
 '屏': 869,
 '澄': 870,
 '復': 871,
 '甲': 872,
 '麻': 873,
 '鞠': 874,
 '瀚': 875,
 '鎖': 876,
 '余': 877,
 '幹': 878,
 '深': 879,
 '崗': 880,
 '棠': 881,
 '錄': 882,
 '璋': 883,
 '非': 884,
 '鞏': 885,
 '懿': 886,
 '村': 887,
 '招': 888,
 '禎': 889,
 '甄': 890,
 '臘': 891,
 '裘': 892,
 '淼': 893,
 '巫': 894,
 '奚': 895,
 '語': 896,
 '廉': 897,
 '儉': 898,
 '逢': 899,
 '實': 900,
 '絲': 901,
 '見': 902,
 '漫': 903,
 '敦': 904,
 '翼': 905,
 '溪': 906,
 '荊': 907,
 '蘆': 908,
 '典': 909,
 '玥': 910,
 '鏡': 911,
 '慈': 912,
 '則': 913,
 '舜': 914,
 '衝': 915,
 '居': 916,
 '赫': 917,
 '諸': 918,
 '晗': 919,
 '泳': 920,
 '挺': 921,
 '社': 922,
 '泓': 923,
 '炯': 924,
 '計': 925,
 '浪': 926,
 '綱': 927,
 '習': 928,
 '鶯': 929,
 '麒': 930,
 '坡': 931,
 '予': 932,
 '鐸': 933,
 '多': 934,
 '爭': 935,
 '霏': 936,
 '臻': 937,
 '佘': 938,
 '鐘': 939,
 '音': 940,
 '格': 941,
 '津': 942,
 '玫': 943,
 '湖': 944,
 '土': 945,
 '唯': 946,
 '蘊': 947,
 '沁': 948,
 '五': 949,
 '戈': 950,
 '焱': 951,
 '誌': 952,
 '樺': 953,
 '靚': 954,
 '鄺': 955,
 '肇': 956,
 '積': 957,
 '謀': 958,
 '索': 959,
 '貽': 960,
 '楷': 961,
 '贊': 962,
 '歌': 963,
 '征': 964,
 '記': 965,
 '際': 966,
 '領': 967,
 '閣': 968,
 '圖': 969,
 '皮': 970,
 '勞': 971,
 '胥': 972,
 '敖': 973,
 '瀅': 974,
 '渝': 975,
 '植': 976,
 '櫻': 977,
 '藺': 978,
 '昂': 979,
 '叔': 980,
 '恬': 981,
 '憶': 982,
 '野': 983,
 '亨': 984,
 '苟': 985,
 '都': 986,
 '愷': 987,
 '妤': 988,
 '淳': 989,
 '爽': 990,
 '倉': 991,
 '團': 992,
 '筠': 993,
 '穩': 994,
 '流': 995,
 '綵': 996,
 '競': 997,
 '茅': 998,
 '楨': 999,
 '豆': 1000,
 ...}

Padding

  • When padding the all texts into uniform lengths, consider whether to Pre-padding or removing values from the beginning of the sequence (i.e., pre) or the other way (post).

  • Check padding and truncating parameters in pad_sequences

names_lens=[len(n) for n in names_ints]
names_lens
import seaborn as sns
sns.displot(names_lens)
print(names[np.argmax(names_lens)]) # longest name
李照華
../_images/8-dl-chinese-name-gender_29_1.png
max_len = names_lens[np.argmax(names_lens)]
max_len
3
names_ints_pad = sequence.pad_sequences(names_ints, maxlen = max_len)
names_ints_pad[:10]
array([[  2, 585,  10],
       [ 78, 250,  48],
       [918, 340,  18],
       [  7,  95, 749],
       [197, 228, 228],
       [ 73, 641, 641],
       [330, 242, 458],
       [327,  28,  48],
       [  7, 525, 542],
       [ 63,  18,  50]], dtype=int32)

Define X and Y

X_train = np.array(names_ints_pad).astype('int32')
y_train = np.array(labels)

X_test = np.array(sequence.pad_sequences(
    tokenizer.texts_to_sequences([n for (n,l) in test_set]),
    maxlen = max_len)).astype('int32')
y_test = np.array([l for (n,l) in test_set])

X_test_texts = [n for (n,l) in test_set]
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(732516, 3)
(732516,)
(183129, 3)
(183129,)

Method 2: Text to Matrix

One-Hot Encoding

  • Text to Matrix (to create bag-of-word representation of each text)

  • Choose modes: binary, count, or tfidf

names_matrix = tokenizer.texts_to_matrix(names, mode="binary")
names[2]
'諸葛偉'
  • names_matrix in fact is a bag-of-characters representation of a name text.

import pandas as pd
pd.DataFrame(names_matrix[2,1:], 
             columns=["ONE-HOT"],
             index=list(tokenizer.word_index.keys()))
ONE-HOT
0.0
0.0
0.0
0.0
0.0
... ...
0.0
0.0
0.0
0.0
0.0

2240 rows × 1 columns

Define X and Y

X_train2 = np.array(names_matrix).astype('int32')
y_train2 = np.array(labels)

X_test2 = tokenizer.texts_to_matrix([n for (n,l) in test_set], mode="binary").astype('int32')
y_test2 = np.array([l for (n,l) in test_set])

X_test2_texts = [n for (n,l) in test_set]
print(X_train2.shape)
print(y_train2.shape)
print(X_test2.shape)
print(y_test2.shape)
(732516, 2241)
(732516,)
(183129, 2241)
(183129,)

Model Definition

  • After we have defined our input and output tensors (X and y), we can define the architecture of our neural network model.

  • For the two ways of name vectorized representations, we try two different network structures.

    • Text to Sequences: Embedding + RNN

    • Text to Matrix: Fully connected Dense Layers

import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
# Plotting results
def plot1(history):

    matplotlib.rcParams['figure.dpi'] = 100
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(1, len(acc)+1)
    ## Accuracy plot
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    ## Loss plot
    plt.figure()

    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
    plt.show()

    
def plot2(history):
    pd.DataFrame(history.history).plot(figsize=(8,5))
    plt.grid(True)
    #plt.gca().set_ylim(0,1)
    plt.show()

Model 1: Fully Connected Dense Layers

  • Two fully-connected dense layers with the Text-to-Matrix inputs

from keras import layers
model1 = keras.Sequential()
model1.add(keras.Input(shape=(vocab_size,), name="one_hot_input"))
model1.add(layers.Dense(16, activation="relu", name="dense_layer_1"))
model1.add(layers.Dense(16, activation="relu", name="dense_layer_2"))
model1.add(layers.Dense(1, activation="sigmoid", name="output"))

model1.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=["accuracy"]
)
plot_model(model1, show_shapes=True)
../_images/8-dl-chinese-name-gender_51_0.png

A few hyperparameters for network training

  • Batch size

  • Epoch

  • Validation Split Ratio

BATCH_SIZE=512
EPOCHS=10
VALIDATION_SPLIT=0.2
history1 = model1.fit(X_train2, y_train2, 
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, verbose=2,
                   validation_split = VALIDATION_SPLIT)
Epoch 1/20
4579/4579 - 12s - loss: 0.0727 - accuracy: 0.9718 - val_loss: 0.0508 - val_accuracy: 0.9806
Epoch 2/20
4579/4579 - 8s - loss: 0.0447 - accuracy: 0.9830 - val_loss: 0.0416 - val_accuracy: 0.9844
Epoch 3/20
4579/4579 - 8s - loss: 0.0376 - accuracy: 0.9854 - val_loss: 0.0380 - val_accuracy: 0.9855
Epoch 4/20
4579/4579 - 8s - loss: 0.0338 - accuracy: 0.9868 - val_loss: 0.0381 - val_accuracy: 0.9854
Epoch 5/20
4579/4579 - 8s - loss: 0.0310 - accuracy: 0.9876 - val_loss: 0.0363 - val_accuracy: 0.9863
Epoch 6/20
4579/4579 - 8s - loss: 0.0287 - accuracy: 0.9885 - val_loss: 0.0356 - val_accuracy: 0.9867
Epoch 7/20
4579/4579 - 8s - loss: 0.0269 - accuracy: 0.9891 - val_loss: 0.0352 - val_accuracy: 0.9870
Epoch 8/20
4579/4579 - 8s - loss: 0.0256 - accuracy: 0.9896 - val_loss: 0.0342 - val_accuracy: 0.9875
Epoch 9/20
4579/4579 - 8s - loss: 0.0245 - accuracy: 0.9899 - val_loss: 0.0344 - val_accuracy: 0.9878
Epoch 10/20
4579/4579 - 8s - loss: 0.0235 - accuracy: 0.9902 - val_loss: 0.0345 - val_accuracy: 0.9878
Epoch 11/20
4579/4579 - 9s - loss: 0.0227 - accuracy: 0.9905 - val_loss: 0.0348 - val_accuracy: 0.9878
Epoch 12/20
4579/4579 - 9s - loss: 0.0220 - accuracy: 0.9908 - val_loss: 0.0354 - val_accuracy: 0.9875
Epoch 13/20
4579/4579 - 8s - loss: 0.0215 - accuracy: 0.9909 - val_loss: 0.0353 - val_accuracy: 0.9875
Epoch 14/20
4579/4579 - 8s - loss: 0.0211 - accuracy: 0.9911 - val_loss: 0.0361 - val_accuracy: 0.9874
Epoch 15/20
4579/4579 - 8s - loss: 0.0206 - accuracy: 0.9912 - val_loss: 0.0360 - val_accuracy: 0.9875
Epoch 16/20
4579/4579 - 9s - loss: 0.0202 - accuracy: 0.9912 - val_loss: 0.0365 - val_accuracy: 0.9875
Epoch 17/20
4579/4579 - 8s - loss: 0.0198 - accuracy: 0.9914 - val_loss: 0.0381 - val_accuracy: 0.9874
Epoch 18/20
4579/4579 - 8s - loss: 0.0196 - accuracy: 0.9914 - val_loss: 0.0377 - val_accuracy: 0.9874
Epoch 19/20
4579/4579 - 8s - loss: 0.0192 - accuracy: 0.9916 - val_loss: 0.0393 - val_accuracy: 0.9872
Epoch 20/20
4579/4579 - 9s - loss: 0.0191 - accuracy: 0.9916 - val_loss: 0.0380 - val_accuracy: 0.9873
plot2(history1)
../_images/8-dl-chinese-name-gender_55_0.png
model1.evaluate(X_test2, y_test2, batch_size=128, verbose=2)
1431/1431 - 2s - loss: 0.0364 - accuracy: 0.9876
[0.03635438531637192, 0.9876261949539185]

Model 2: Embedding + RNN

  • One Embedding Layer + One RNN Layer

  • With Text-to-Sequence inputs

EMBEDDING_DIM = 128
model2 = Sequential()
model2.add(Embedding(input_dim=vocab_size, 
                     output_dim=EMBEDDING_DIM, 
                     input_length=max_len, 
                     mask_zero=True))
model2.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer"))
model2.add(Dense(16, activation="relu", name="dense_layer"))
model2.add(Dense(1, activation="sigmoid", name="output"))

model2.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=["accuracy"]
)
plot_model(model2, show_shapes=True)
../_images/8-dl-chinese-name-gender_61_0.png
history2 = model2.fit(X_train, y_train, 
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, verbose=2,
                    validation_split = VALIDATION_SPLIT)
Epoch 1/20
4579/4579 - 16s - loss: 0.0189 - accuracy: 0.9929 - val_loss: 0.0053 - val_accuracy: 0.9984
Epoch 2/20
4579/4579 - 14s - loss: 0.0027 - accuracy: 0.9992 - val_loss: 0.0029 - val_accuracy: 0.9992
Epoch 3/20
4579/4579 - 14s - loss: 0.0016 - accuracy: 0.9995 - val_loss: 0.0028 - val_accuracy: 0.9992
Epoch 4/20
4579/4579 - 14s - loss: 0.0010 - accuracy: 0.9997 - val_loss: 0.0033 - val_accuracy: 0.9991
Epoch 5/20
4579/4579 - 14s - loss: 7.8076e-04 - accuracy: 0.9997 - val_loss: 0.0022 - val_accuracy: 0.9995
Epoch 6/20
4579/4579 - 14s - loss: 6.2756e-04 - accuracy: 0.9998 - val_loss: 0.0024 - val_accuracy: 0.9995
Epoch 7/20
4579/4579 - 14s - loss: 4.5845e-04 - accuracy: 0.9999 - val_loss: 0.0027 - val_accuracy: 0.9994
Epoch 8/20
4579/4579 - 14s - loss: 3.9266e-04 - accuracy: 0.9999 - val_loss: 0.0025 - val_accuracy: 0.9995
Epoch 9/20
4579/4579 - 15s - loss: 2.9708e-04 - accuracy: 0.9999 - val_loss: 0.0025 - val_accuracy: 0.9995
Epoch 10/20
4579/4579 - 15s - loss: 2.7430e-04 - accuracy: 0.9999 - val_loss: 0.0030 - val_accuracy: 0.9994
Epoch 11/20
4579/4579 - 15s - loss: 2.2277e-04 - accuracy: 0.9999 - val_loss: 0.0025 - val_accuracy: 0.9996
Epoch 12/20
4579/4579 - 14s - loss: 2.2406e-04 - accuracy: 0.9999 - val_loss: 0.0028 - val_accuracy: 0.9995
Epoch 13/20
4579/4579 - 14s - loss: 1.8844e-04 - accuracy: 1.0000 - val_loss: 0.0029 - val_accuracy: 0.9995
Epoch 14/20
4579/4579 - 16s - loss: 1.4918e-04 - accuracy: 0.9999 - val_loss: 0.0028 - val_accuracy: 0.9996
Epoch 15/20
4579/4579 - 14s - loss: 1.3319e-04 - accuracy: 1.0000 - val_loss: 0.0033 - val_accuracy: 0.9995
Epoch 16/20
4579/4579 - 16s - loss: 9.9342e-05 - accuracy: 1.0000 - val_loss: 0.0033 - val_accuracy: 0.9996
Epoch 17/20
4579/4579 - 15s - loss: 1.6295e-04 - accuracy: 0.9999 - val_loss: 0.0030 - val_accuracy: 0.9996
Epoch 18/20
4579/4579 - 14s - loss: 9.1653e-05 - accuracy: 1.0000 - val_loss: 0.0029 - val_accuracy: 0.9996
Epoch 19/20
4579/4579 - 14s - loss: 7.0746e-05 - accuracy: 1.0000 - val_loss: 0.0032 - val_accuracy: 0.9995
Epoch 20/20
4579/4579 - 15s - loss: 8.1217e-05 - accuracy: 1.0000 - val_loss: 0.0035 - val_accuracy: 0.9995
plot2(history2)
../_images/8-dl-chinese-name-gender_63_0.png
model2.evaluate(X_test, y_test, batch_size=128, verbose=2)
1431/1431 - 1s - loss: 0.0040 - accuracy: 0.9996
[0.004011294338852167, 0.9995522499084473]

Model 3: Regularization and Dropout

  • Previous two examples clearly show overfitting of the models because the model performance on the validation set starts to stall after the first few epochs.

  • We can implement regularization and dropouts in our network definition to avoid overfitting.

EMBEDDING_DIM = 128
model3 = Sequential()
model3.add(Embedding(input_dim=vocab_size, 
                     output_dim=EMBEDDING_DIM, 
                     input_length=max_len, 
                     mask_zero=True))
model3.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2))
model3.add(Dense(16, activation="relu", name="dense_layer"))
model3.add(Dense(1, activation="sigmoid", name="output"))

model3.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=["accuracy"]
)
plot_model(model3)
../_images/8-dl-chinese-name-gender_68_0.png
history3 = model3.fit(X_train, y_train, 
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, verbose=2,
                    validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 2s - loss: 0.6195 - accuracy: 0.6271 - val_loss: 0.5489 - val_accuracy: 0.6412
Epoch 2/20
40/40 - 0s - loss: 0.5249 - accuracy: 0.6642 - val_loss: 0.4955 - val_accuracy: 0.7671
Epoch 3/20
40/40 - 0s - loss: 0.4845 - accuracy: 0.7661 - val_loss: 0.4693 - val_accuracy: 0.7703
Epoch 4/20
40/40 - 0s - loss: 0.4598 - accuracy: 0.7797 - val_loss: 0.4505 - val_accuracy: 0.7766
Epoch 5/20
40/40 - 0s - loss: 0.4432 - accuracy: 0.7807 - val_loss: 0.4464 - val_accuracy: 0.7718
Epoch 6/20
40/40 - 0s - loss: 0.4338 - accuracy: 0.7860 - val_loss: 0.4348 - val_accuracy: 0.7931
Epoch 7/20
40/40 - 0s - loss: 0.4261 - accuracy: 0.7907 - val_loss: 0.4310 - val_accuracy: 0.7931
Epoch 8/20
40/40 - 0s - loss: 0.4206 - accuracy: 0.7970 - val_loss: 0.4293 - val_accuracy: 0.7946
Epoch 9/20
40/40 - 0s - loss: 0.4223 - accuracy: 0.7972 - val_loss: 0.4262 - val_accuracy: 0.7923
Epoch 10/20
40/40 - 0s - loss: 0.4153 - accuracy: 0.8004 - val_loss: 0.4337 - val_accuracy: 0.7766
Epoch 11/20
40/40 - 0s - loss: 0.4158 - accuracy: 0.8031 - val_loss: 0.4331 - val_accuracy: 0.7844
Epoch 12/20
40/40 - 0s - loss: 0.4173 - accuracy: 0.7943 - val_loss: 0.4289 - val_accuracy: 0.7923
Epoch 13/20
40/40 - 0s - loss: 0.4122 - accuracy: 0.8068 - val_loss: 0.4245 - val_accuracy: 0.7939
Epoch 14/20
40/40 - 0s - loss: 0.4089 - accuracy: 0.8059 - val_loss: 0.4259 - val_accuracy: 0.7939
Epoch 15/20
40/40 - 0s - loss: 0.4143 - accuracy: 0.8072 - val_loss: 0.4298 - val_accuracy: 0.7891
Epoch 16/20
40/40 - 0s - loss: 0.4114 - accuracy: 0.8041 - val_loss: 0.4241 - val_accuracy: 0.8017
Epoch 17/20
40/40 - 0s - loss: 0.4062 - accuracy: 0.8076 - val_loss: 0.4227 - val_accuracy: 0.7931
Epoch 18/20
40/40 - 0s - loss: 0.4052 - accuracy: 0.8110 - val_loss: 0.4278 - val_accuracy: 0.7939
Epoch 19/20
40/40 - 0s - loss: 0.4100 - accuracy: 0.8068 - val_loss: 0.4218 - val_accuracy: 0.7954
Epoch 20/20
40/40 - 0s - loss: 0.4091 - accuracy: 0.8072 - val_loss: 0.4208 - val_accuracy: 0.7939
plot2(history3)
../_images/8-dl-chinese-name-gender_70_0.png
model3.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4074 - accuracy: 0.8074
[0.4073854982852936, 0.8074260354042053]

Model 4: Improve the Models

  • In addition to regularization and dropouts, we can further improve the model by increasing the model complexity.

  • In particular, we can increase the depths and widths of the network layers.

  • Let’s try stack two RNN layers.

EMBEDDING_DIM = 128
model4 = Sequential()
model4.add(Embedding(input_dim=vocab_size, 
                     output_dim=EMBEDDING_DIM, 
                     input_length=max_len, 
                     mask_zero=True))
model4.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer_1", 
                            dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model4.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer_2", 
                            dropout=0.2, recurrent_dropout=0.2))
model4.add(Dense(1, activation="sigmoid", name="output"))
model4.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=["accuracy"]
)
plot_model(model4)
../_images/8-dl-chinese-name-gender_75_0.png
history4 = model4.fit(X_train, y_train, 
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, verbose=2,
                    validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 3s - loss: 0.6870 - accuracy: 0.6170 - val_loss: 0.6689 - val_accuracy: 0.7364
Epoch 2/20
40/40 - 0s - loss: 0.6519 - accuracy: 0.7341 - val_loss: 0.6269 - val_accuracy: 0.7750
Epoch 3/20
40/40 - 0s - loss: 0.5970 - accuracy: 0.7618 - val_loss: 0.5502 - val_accuracy: 0.7734
Epoch 4/20
40/40 - 0s - loss: 0.5222 - accuracy: 0.7689 - val_loss: 0.4882 - val_accuracy: 0.7758
Epoch 5/20
40/40 - 0s - loss: 0.4833 - accuracy: 0.7750 - val_loss: 0.4643 - val_accuracy: 0.7789
Epoch 6/20
40/40 - 0s - loss: 0.4689 - accuracy: 0.7815 - val_loss: 0.4557 - val_accuracy: 0.7836
Epoch 7/20
40/40 - 0s - loss: 0.4560 - accuracy: 0.7840 - val_loss: 0.4544 - val_accuracy: 0.7828
Epoch 8/20
40/40 - 0s - loss: 0.4532 - accuracy: 0.7870 - val_loss: 0.4545 - val_accuracy: 0.7836
Epoch 9/20
40/40 - 0s - loss: 0.4447 - accuracy: 0.7937 - val_loss: 0.4496 - val_accuracy: 0.7876
Epoch 10/20
40/40 - 0s - loss: 0.4490 - accuracy: 0.7889 - val_loss: 0.4492 - val_accuracy: 0.7844
Epoch 11/20
40/40 - 0s - loss: 0.4440 - accuracy: 0.7889 - val_loss: 0.4468 - val_accuracy: 0.7868
Epoch 12/20
40/40 - 0s - loss: 0.4301 - accuracy: 0.8019 - val_loss: 0.4464 - val_accuracy: 0.7852
Epoch 13/20
40/40 - 0s - loss: 0.4411 - accuracy: 0.7884 - val_loss: 0.4442 - val_accuracy: 0.7884
Epoch 14/20
40/40 - 0s - loss: 0.4314 - accuracy: 0.7992 - val_loss: 0.4424 - val_accuracy: 0.7876
Epoch 15/20
40/40 - 1s - loss: 0.4323 - accuracy: 0.7948 - val_loss: 0.4442 - val_accuracy: 0.7884
Epoch 16/20
40/40 - 0s - loss: 0.4318 - accuracy: 0.7962 - val_loss: 0.4434 - val_accuracy: 0.7907
Epoch 17/20
40/40 - 0s - loss: 0.4276 - accuracy: 0.7962 - val_loss: 0.4421 - val_accuracy: 0.7852
Epoch 18/20
40/40 - 0s - loss: 0.4328 - accuracy: 0.7941 - val_loss: 0.4408 - val_accuracy: 0.7907
Epoch 19/20
40/40 - 0s - loss: 0.4277 - accuracy: 0.7960 - val_loss: 0.4375 - val_accuracy: 0.7939
Epoch 20/20
40/40 - 0s - loss: 0.4343 - accuracy: 0.7905 - val_loss: 0.4461 - val_accuracy: 0.7844
plot2(history4)
../_images/8-dl-chinese-name-gender_77_0.png
model4.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4397 - accuracy: 0.7885
[0.43966466188430786, 0.7885462641716003]

Model 5: Bidirectional

  • Now let’s try the more sophisticated RNN, LSTM, and with birectional computing.

  • And add more nodes to the LSTM layer.

EMBEDDING_DIM = 128
model5 = Sequential()
model5.add(Embedding(input_dim=vocab_size, 
                     output_dim=EMBEDDING_DIM, 
                     input_length=max_len, 
                     mask_zero=True))
model5.add(layers.Bidirectional(LSTM(32, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2)))
model5.add(Dense(1, activation="sigmoid", name="output"))

model5.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=["accuracy"]
)
plot_model(model5)
../_images/8-dl-chinese-name-gender_82_0.png
history5 = model5.fit(X_train, y_train, 
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, verbose=2,
                    validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 7s - loss: 0.6414 - accuracy: 0.6271 - val_loss: 0.5996 - val_accuracy: 0.6452
Epoch 2/20
40/40 - 1s - loss: 0.5327 - accuracy: 0.7140 - val_loss: 0.4618 - val_accuracy: 0.7718
Epoch 3/20
40/40 - 1s - loss: 0.4513 - accuracy: 0.7854 - val_loss: 0.4434 - val_accuracy: 0.7805
Epoch 4/20
40/40 - 1s - loss: 0.4318 - accuracy: 0.7941 - val_loss: 0.4355 - val_accuracy: 0.7954
Epoch 5/20
40/40 - 1s - loss: 0.4224 - accuracy: 0.7976 - val_loss: 0.4345 - val_accuracy: 0.8025
Epoch 6/20
40/40 - 1s - loss: 0.4148 - accuracy: 0.8004 - val_loss: 0.4271 - val_accuracy: 0.7876
Epoch 7/20
40/40 - 1s - loss: 0.4087 - accuracy: 0.8084 - val_loss: 0.4253 - val_accuracy: 0.7891
Epoch 8/20
40/40 - 1s - loss: 0.4065 - accuracy: 0.8106 - val_loss: 0.4222 - val_accuracy: 0.7891
Epoch 9/20
40/40 - 1s - loss: 0.4027 - accuracy: 0.8070 - val_loss: 0.4220 - val_accuracy: 0.7884
Epoch 10/20
40/40 - 1s - loss: 0.4012 - accuracy: 0.8118 - val_loss: 0.4195 - val_accuracy: 0.7939
Epoch 11/20
40/40 - 1s - loss: 0.3954 - accuracy: 0.8125 - val_loss: 0.4188 - val_accuracy: 0.7931
Epoch 12/20
40/40 - 1s - loss: 0.3961 - accuracy: 0.8131 - val_loss: 0.4154 - val_accuracy: 0.7962
Epoch 13/20
40/40 - 1s - loss: 0.3924 - accuracy: 0.8151 - val_loss: 0.4139 - val_accuracy: 0.8009
Epoch 14/20
40/40 - 1s - loss: 0.3902 - accuracy: 0.8185 - val_loss: 0.4127 - val_accuracy: 0.7954
Epoch 15/20
40/40 - 1s - loss: 0.3866 - accuracy: 0.8173 - val_loss: 0.4157 - val_accuracy: 0.7946
Epoch 16/20
40/40 - 1s - loss: 0.3861 - accuracy: 0.8190 - val_loss: 0.4129 - val_accuracy: 0.7970
Epoch 17/20
40/40 - 1s - loss: 0.3862 - accuracy: 0.8161 - val_loss: 0.4116 - val_accuracy: 0.7994
Epoch 18/20
40/40 - 1s - loss: 0.3825 - accuracy: 0.8200 - val_loss: 0.4128 - val_accuracy: 0.7970
Epoch 19/20
40/40 - 1s - loss: 0.3792 - accuracy: 0.8157 - val_loss: 0.4102 - val_accuracy: 0.7978
Epoch 20/20
40/40 - 1s - loss: 0.3768 - accuracy: 0.8212 - val_loss: 0.4073 - val_accuracy: 0.7970
plot2(history5)
../_images/8-dl-chinese-name-gender_84_0.png
model5.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.3951 - accuracy: 0.8125
[0.3950510621070862, 0.8124606609344482]

Check Embeddings

  • Compared to one-hot encodings of characters, embeddings may include more information relating to the characteristics of the characters.

  • We can extract the embedding layer and apply dimensional reduction techniques (i.e., TSNE) to see how embeddings capture the relationships in-between characters.

X_test[10]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 9, 9, 3, 2], dtype=int32)
ind2char = tokenizer.index_word
[ind2char.get(i) for i in X_test[10] if ind2char.get(i)!= None ]
['n', 'e', 's', 's', 'i', 'e']
tokenizer.texts_to_sequences('Alvin')
[[1], [6], [20], [3], [4]]
char_vectors = model5.layers[0].get_weights()[0]
char_vectors.shape
(29, 128)
labels = [char for (ind, char) in tokenizer.index_word.items()]
labels.insert(0,None)
labels
[None,
 'a',
 'e',
 'i',
 'n',
 'r',
 'l',
 'o',
 't',
 's',
 'd',
 'm',
 'y',
 'h',
 'c',
 'b',
 'u',
 'g',
 'k',
 'j',
 'v',
 'f',
 'p',
 'w',
 'z',
 'x',
 'q',
 '-',
 ' ']
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=2)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(char_vectors)
labels = labels

plt.figure(figsize=(10, 7), dpi=150)
plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
    plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')
../_images/8-dl-chinese-name-gender_93_0.png

Issues of Word/Character Representations

  • One-hot encoding does not indicate semantic relationships between characters.

  • For deep learning NLP, it is preferred to convert one-hot encodings of words/characters into embeddings, which are argued to include more semantic information of the tokens.

  • Now the question is how to train and create better word embeddings. We will come back to this issue later.

Hyperparameter Tuning

Note

Please install keras tuner module in your current conda:

pip install -U keras-tuner
  • Like feature-based ML methods, neural networks also come with many hyperparameters, which require default values.

  • Typical hyperparameters include:

    • Number of nodes for the layer

    • Learning Rates

  • We can utilize the module, keras-tuner, to fine-tune the hyperparameters.

  • Steps for Keras Tuner

    • First, wrap the model definition in a function, which takes a single hp argument.

    • Inside this function, replace any value we want to tune with a call to hyperparameter sampling methods, e.g. hp.Int() or hp.Choice(). The function should return a compiled model.

    • Next, instantiate a tuner object specifying your optimization objective and other search parameters.

    • Finally, start the search with the search() method, which takes the same arguments as Model.fit() in keras.

    • When search is over, we can retrieve the best model and a summary of the results from the tunner.

import kerastuner
## Wrap model definition in a function
## and specify the parameters needed for tuning
# def build_model(hp):
#     model1 = keras.Sequential()
#     model1.add(keras.Input(shape=(max_len,)))
#     model1.add(layers.Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation="relu", name="dense_layer_1"))
#     model1.add(layers.Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation="relu", name="dense_layer_2"))
#     model1.add(layers.Dense(2, activation="softmax", name="output"))
#     model1.compile(
#         optimizer=keras.optimizers.Adam(
#             hp.Choice('learning_rate',
#                       values=[1e-2, 1e-3, 1e-4])),
#         loss='sparse_categorical_crossentropy',
#         metrics=['accuracy'])
#     return model1

def build_model(hp):
    m= Sequential()
    m.add(Embedding(input_dim=vocab_size, 
                    output_dim=hp.Int('output_dim', min_value=32, max_value=128, step=32), 
                    input_length=max_len, 
                    mask_zero=True))
    m.add(layers.Bidirectional(LSTM(
        hp.Int('units', min_value=16, max_value=64, step=16),
        activation="relu", 
        dropout=0.2, 
        recurrent_dropout=0.2)))
    m.add(Dense(2, activation="softmax", name="output"))

    m.compile(
        loss=keras.losses.SparseCategoricalCrossentropy(),
        optimizer=keras.optimizers.Adam(lr=0.001),
        metrics=["accuracy"]
    )
    return m
## This is to clean up the temp dir from the tuner
## Every time we re-start the tunner, it's better to keep the temp dir clean

import os
import shutil

if os.path.isdir('my_dir'):
    shutil.rmtree('my_dir')
    
  • The max_trials variable represents the number of hyperparameter combinations that will be tested by the tuner.

  • The execution_per_trial variable is the number of models that should be built and fit for each trial for robustness purposes.

## Instantiate the tunner

tuner = kerastuner.tuners.RandomSearch(
  build_model,
  objective='val_accuracy',
  max_trials=10,
  executions_per_trial=2,
  directory='my_dir')
## Check the tuner's search space
tuner.search_space_summary()
Search space summary
Default search space size: 2
output_dim (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 128, 'step': 32, 'sampling': None}
units (Int)
{'default': None, 'conditions': [], 'min_value': 16, 'max_value': 64, 'step': 16, 'sampling': None}
%%time
## Start tuning with the tuner
tuner.search(X_train, y_train, validation_split=0.2, batch_size=128)
Trial 10 Complete [00h 00m 15s]
val_accuracy: 0.6420141458511353

Best val_accuracy So Far: 0.645948052406311
Total elapsed time: 00h 02m 29s
INFO:tensorflow:Oracle triggered exit
CPU times: user 3min 10s, sys: 4.9 s, total: 3min 15s
Wall time: 2min 29s
## Retrieve the best models from the tuner
models = tuner.get_best_models(num_models=2)
plot_model(models[0], show_shapes=True)
../_images/8-dl-chinese-name-gender_108_0.png
## Retrieve the summary of results from the tuner
tuner.results_summary()
Results summary
Results in my_dir/untitled_project
Showing 10 best trials
Objective(name='val_accuracy', direction='max')
Trial summary
Hyperparameters:
output_dim: 128
units: 48
Score: 0.645948052406311
Trial summary
Hyperparameters:
output_dim: 128
units: 32
Score: 0.6451612710952759
Trial summary
Hyperparameters:
output_dim: 128
units: 16
Score: 0.6435877084732056
Trial summary
Hyperparameters:
output_dim: 96
units: 48
Score: 0.643194317817688
Trial summary
Hyperparameters:
output_dim: 96
units: 32
Score: 0.6420141458511353
Trial summary
Hyperparameters:
output_dim: 64
units: 32
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 64
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 16
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 48
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 32
Score: 0.6412273645401001

Explanation

Train Model with the Tuned Hyperparameters

EMBEDDING_DIM = 128
model6 = Sequential()
model6.add(Embedding(input_dim=vocab_size, 
                     output_dim=EMBEDDING_DIM, 
                     input_length=max_len, 
                     mask_zero=True))
model6.add(layers.Bidirectional(LSTM(64, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2)))
model6.add(Dense(2, activation="softmax", name="output"))

model6.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=["accuracy"]
)
plot_model(model6)
../_images/8-dl-chinese-name-gender_112_0.png
history6 = model6.fit(X_train, y_train, 
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, verbose=2,
                    validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 7s - loss: 0.6451 - accuracy: 0.6176 - val_loss: 0.6015 - val_accuracy: 0.6467
Epoch 2/20
40/40 - 2s - loss: 0.5283 - accuracy: 0.7260 - val_loss: 0.4695 - val_accuracy: 0.7726
Epoch 3/20
40/40 - 2s - loss: 0.4704 - accuracy: 0.7817 - val_loss: 0.4577 - val_accuracy: 0.7852
Epoch 4/20
40/40 - 2s - loss: 0.4370 - accuracy: 0.7927 - val_loss: 0.4492 - val_accuracy: 0.7687
Epoch 5/20
40/40 - 2s - loss: 0.4294 - accuracy: 0.7933 - val_loss: 0.4391 - val_accuracy: 0.7899
Epoch 6/20
40/40 - 2s - loss: 0.4208 - accuracy: 0.8019 - val_loss: 0.4303 - val_accuracy: 0.7954
Epoch 7/20
40/40 - 2s - loss: 0.4178 - accuracy: 0.8013 - val_loss: 0.4278 - val_accuracy: 0.7994
Epoch 8/20
40/40 - 2s - loss: 0.4103 - accuracy: 0.8041 - val_loss: 0.4304 - val_accuracy: 0.8002
Epoch 9/20
40/40 - 2s - loss: 0.4089 - accuracy: 0.8059 - val_loss: 0.4210 - val_accuracy: 0.8002
Epoch 10/20
40/40 - 2s - loss: 0.4013 - accuracy: 0.8145 - val_loss: 0.4184 - val_accuracy: 0.7939
Epoch 11/20
40/40 - 2s - loss: 0.3989 - accuracy: 0.8139 - val_loss: 0.4161 - val_accuracy: 0.8002
Epoch 12/20
40/40 - 2s - loss: 0.3991 - accuracy: 0.8118 - val_loss: 0.4161 - val_accuracy: 0.7939
Epoch 13/20
40/40 - 2s - loss: 0.3908 - accuracy: 0.8153 - val_loss: 0.4149 - val_accuracy: 0.8009
Epoch 14/20
40/40 - 2s - loss: 0.3909 - accuracy: 0.8167 - val_loss: 0.4163 - val_accuracy: 0.7970
Epoch 15/20
40/40 - 2s - loss: 0.3835 - accuracy: 0.8194 - val_loss: 0.4159 - val_accuracy: 0.7915
Epoch 16/20
40/40 - 2s - loss: 0.3879 - accuracy: 0.8173 - val_loss: 0.4123 - val_accuracy: 0.8025
Epoch 17/20
40/40 - 2s - loss: 0.3783 - accuracy: 0.8236 - val_loss: 0.4101 - val_accuracy: 0.8009
Epoch 18/20
40/40 - 2s - loss: 0.3775 - accuracy: 0.8212 - val_loss: 0.4073 - val_accuracy: 0.7962
Epoch 19/20
40/40 - 2s - loss: 0.3772 - accuracy: 0.8251 - val_loss: 0.4123 - val_accuracy: 0.8025
Epoch 20/20
40/40 - 2s - loss: 0.3694 - accuracy: 0.8293 - val_loss: 0.4045 - val_accuracy: 0.8167
plot2(history6)
../_images/8-dl-chinese-name-gender_114_0.png

Interpret the Model

from lime.lime_text import LimeTextExplainer

explainer = LimeTextExplainer(class_names=['Male'], char_level=True)
def model_predict_pipeline(text):
    _seq = tokenizer.texts_to_sequences(text)
    _seq_pad = keras.preprocessing.sequence.pad_sequences(_seq, maxlen=max_len)
    #return np.array([[float(1-x), float(x)] for x in model.predict(np.array(_seq_pad))])
    return model2.predict(np.array(_seq_pad))



# np.array(sequence.pad_sequences(
#     tokenizer.texts_to_sequences([n for (n,l) in test_set]),
#     maxlen = max_len)).astype('float32')
reversed_word_index = dict([(index, word) for (word, index) in tokenizer.word_index.items()])
text_id =305
X_test[text_id]
array([126, 112, 101], dtype=int32)
X_test_texts[text_id]
'潘少斌'
' '.join([reversed_word_index.get(i, '?') for i in X_test[text_id]])
'潘 少 斌'
model_predict_pipeline([X_test_texts[text_id]])
array([[1.]], dtype=float32)
exp = explainer.explain_instance(
X_test_texts[text_id], model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
y_test[text_id]
1
exp = explainer.explain_instance(
'陳宥欣', model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'李安芬', model_predict_pipeline, num_features=2, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'林月名', model_predict_pipeline, num_features=2, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'蔡英文', model_predict_pipeline, num_features=2, top_labels=1)
exp.show_in_notebook(text=True)

References

  • Chollet (2017), Ch 3 and Ch 4